data parallelism
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Hardware (0.68)
- North America > United States (0.05)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.05)
- North America > Canada (0.04)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Virginia (0.05)
- North America > United States > Oregon (0.05)
- (8 more...)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Virginia (0.05)
- North America > United States > Oregon (0.04)
- (8 more...)
SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training
Data parallelism across multiple machines is widely adopted for accelerating distributed deep learning, but it is hard to achieve linear speedup due to the heavy communication. In this paper, we propose SAPipe, a performant system that pushes the training speed of data parallelism to its fullest extent.
ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training
Ding, Yuran, Chen, Xinwei, Zhang, Xiaofan, Zhou, Zongwei
Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Scaling Performance of Large Language Model Pretraining
Interrante-Grant, Alexander, Varela-Rosa, Carla, Narayan, Suhaas, Connelly, Chris, Reuther, Albert
Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI) research companies are investing billions of dollars into supercomputing infrastructure to train progressively larger models on increasingly massive datasets. Unfortunately, very little information about the scaling performance and training considerations of these large training pipelines is released publicly. Working with very large datasets and models can be complex and practical recommendations are scarce in the public literature for tuning training performance when scaling up large language models. In this paper, we aim to demystify the large language model pretraining pipeline somewhat - in particular with respect to distributed training, managing large datasets across hundreds of nodes, and scaling up data parallelism with an emphasis on fully leveraging available GPU compute capacity. Index T erms--large language models, distributed training, data parallelism.
- North America > United States > Massachusetts > Middlesex County > Lexington (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Government > Regional Government (0.50)
- Government > Military (0.31)
End-to-end RL Improves Dexterous Grasping Policies
Singh, Ritvik, Van Wyk, Karl, Abbeel, Pieter, Malik, Jitendra, Ratliff, Nathan, Handa, Ankur
This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies.
- Information Technology > Artificial Intelligence > Robots > Manipulation (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Vision (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)